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ABSTRACT 

Click through rates (CTR) offer useful user feedback that 
can be used to infer the relevance of search results for queries. 
However it is not very meaningful to look at the raw click 
through rate of a search result because the likelihood of a 
result being clicked depends not only on its relevance but 
also the position in which it is displayed. One model of the 
browsing behavior, the Examination Hypothesis |16l O |6|, 
states that each position has a certain probability of being 
examined and is then clicked based on the relevance of the 
search snippets. This is based on eye tracking studies [3l 
[5] which suggest that users are less likely to view results in 
lower positions. Such a position dependent variation in the 
probability of examining a document is referred to as posi- 
tion bias. Our main observation in this study is that the 
position bias tends to differ with the kind of information 
the user is looking for. This makes the position bias query 
specific. 

In this study, we present a model for analyzing a query 
specific position bias from the click data and use these biases 
to derive position independent relevance values of search re- 
sults. Our model is based on the assumption that for a given 
query, the positional click through rate of a document is pro- 
portional to the product of its relevance and a query specific 
position bias. We compare our model with the vanilla exam- 
ination hypothesis model (EH) on a set of queries obtained 
from search logs of a commercial search engine. We also 
compare it with the User Browsing Model (UBM) |£j which 
extends the cascade model of Craswell et al[5] by incorporat- 
ing multiple clicks in a query session. We show that the our 
model, although much simpler to implement, consistently 
outperforms both EH and UBM on well-used measures such 
as relative error and cross entropy. 

1. INTRODUCTION 

Click logs contain valuable user feedback that can be used 
to infer the relevance of search results for queries (see [T] 1121 
113] and references within). One important measure is the 
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click through rate of a search result which is the fraction 
of impressions of that result in clicks. However it is not 
very meaningful to look at the raw click through rate of a 
search result because the likelihood of a result being clicked 
depends not only on its relevance but also the position in 
which it is displayed. One model of the browsing behavior, 
the Examination Hypothesis [161 [5l [6]. states that each posi- 
tion has a certain probability of being examined and is then 
clicked based on the relevance of the search snippets. This 
is based on eye tracking studies [3] [8] which suggest that 
users are less likely to view results in higher ranks. Such a 
position dependent variation in the probability of examining 
a document is referred to as position bias. These position 
bias values can be used to correct the observed click through 
rates at different positions to obtain a better estimate of the 
relevance of the documenlQ. This raises the question of how 
one should estimate the effect of the position bias. One 
method to estimate the position bias is to simply compute 
the aggregate click through rates in each position for a given 
query. Such curves typically show a decreasing click through 
rate from higher to lower positions except for, in some cases, 
a small increase at the last position on the result page. 

However, analyzing the click through rate curve aggre- 
gated over all queries may not be useful to estimate the 
position bias as these values may differ with each query. For 
example, Broder [2] classified queries into three main cat- 
egories, viz, informational, navigational, and transactional. 
An informational query reflects an intent to acquire some in- 
formation assumed to be present on one or more web pages. 
A navigational query, on the other hand, is issued with an 
immediate intent to reach a particular site. For example, 
the query cnn probably targets the site http : / /www, cnn. com] 
and hence can be deemed navigational. Moreover, the user 
expects this result to be shown in one of the top positions 
in the result page. On the other hand, a query like voice 
recognition could be used to target a good collection of 
sites on the subject and therefore the user is more inclined 
to more results including those in the lower positions on the 
page. This behavior would naturally result in a navigational 
query having a different click through rate curve from an in- 
formational query (see Figure]!]). Further, this suggests that 
the position bias depends on the query. 

It may be argued that the difference in the click through 

^We note that click through rate measure need to be com- 
bined with other measures like dwell time as the clicks reflect 
the quality of the snippet rather than the document. Since 
this study focuses on click through rates, we interchange- 
ably use the term document even though it may refer to the 
search snippet. 
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Figure 1: Click through rate curves over positions 
1 through 10 for a navigational query, informational 
query. This shows that the cHck through rate drops 
differently for different queries and suggests that the 
examination probabilities for lower positions may 
depend on the query. 

rate curves for navigational and informational queries arises 
not from a difference in position bias, but due to the sharper 
drop in relevance of search results for navigational queries. 
In this study, we present a model for analyzing a query spe- 
cific position bias from the click data and use these biases 
to derive position independent relevance scores for search 
results. We note that our model by allovsring for the exami- 
nation to be query specific, subsumes the case of query in- 
dependent position biases. Our work dilTers from the earlier 
works based on Examination Hypothesis in that the position 
bias parameter is allowed to be query dependent. 

1.1 Contributions of this Study 

Our model is based on an extension of the Examination 
Hypothesis and states that for a given query, the click through 
rate of a document at a particular position is proportional 
to the product of its relevance (referred to as goodness) and 
query specific position bias. Based on this model, we learn 
the relevance and position bias parameters for different doc- 
uments and queries. We evaluate the accuracy of the pre- 
dicted CTR by comparing it with the CTR values predicted 
by the vanilla examination hypothesis and the user browsing 
model (UBM) of Dupret and Piwowarski [6]. 

We also conduct a cumulative analysis of the derived po- 
sition bias curves for the different queries and derive a single 
parametrized equation to capture the general shape of the 
position bias curve. The parameter value can then be used 
to determine the nature of the query as navigational or infor- 
mational. One of the primary drawbacks of any click-based 
approach for inferring relevance is the sparsity of the under- 
lying data as a large number of documents are never clicked 
for a query. We show how to address this issue by inferring 
the goodness values for unclicked documents through clicks 
on similar queries. 

2. RELATED WORK 

Several research works have exploited the use of user clicks 
as feedback in the context of ads and search results. Others 
have used clicks in conjunction with dwell time and other 
implicit measures. 



Radlinski and Joachims [15] propose a method to learn 
user preferences for search results by artificially introducing 
a small amount of randomization in the order of presentation 
of the results; their idea was to perform flips systematically 
in the system, until it converges to the correct ranking. In 
the context of search advertisements, Richardson et al |17] 
show how to estimate the CTRs for new ads by looking at the 
number of clicks it receives in different positions. Similar to 
our model, they assume the CTR is proportional to the prod- 
uct of the quality of the ad and a position bias. However, un- 
like our model, their position bias parameters are query inde- 
pendent. Joachims [T5] demonstrates how click logs can be 
used to produce training data in optimizing ranking SVMs. 
In another study based on a user behavior, Joachims et al 
[13| suggest several rules for inferring user preferences on 
search results based on click logs. For example, one rule 
'CLICK > SKIP ABOVE' means if a user has skipped several 
search results and then clicks on a later result, this should 
be interpreted as the user preference for the clicked docu- 
ment is greater than for those skipped above it. Agichtein et 
al [l] show how to combine click information based on simi- 
lar rules along with other user behavior parameters such as 
dwell time and search result information such as snippets to 
predict user preferences. Our model, on the other hand, in- 
corporates the CTR values into a system of linear equations 
to obtain relevance and position bias parameters. Fox et al 
[7] study the relationship between implicit measures such as 
clicks, dwell time and explicit human judgments. Craswell 
et al F51 evaluate several models for explaining the effect of 
position bias on click through rate including one where the 
click through rate is proportional to the product of relevance 
and query independent position bias. They also propose a 
cascade model where the click through rate of a document 
at a position is discounted based on the presence of relevant 
documents in higher positions. Dupret and Piwowarski [6] 
present a variant of the cascade model to predict user clicks 
in the presence of position bias. Specifically, their model es- 
timates the probability of examination of a document given 
the rank of the document and the rank of the last clicked 
document. Guo et al. [9] propose a click chain model which 
is based on the assumption that a document in position i 
is examined depending on the relevance of the document in 
the position i — We will briefly describe these click models 
next. Before we do so, we will note the main difference in 
our work from the earlier works based on the Examination 
Hypothesis and the Cascade Models is that the position bias 
parameter is allowed to be query dependent. 

2.1 Current Click Models 

Two important click models which have been later ex- 
tended in many works on click models are the examination 
hypothesis [l^ and the cascade model [5]. 

Examination Hypothesis: Richardson [17] proposed this 
model based on the simple assumption that clicks on docu- 
ments in different positions are only dependent on the rel- 
evance of the document and the likelihood of examining a 
document in that position. They assume that the probabil- 
ity of examining the a document at a position depends only 
on the position and independent of the query and the docu- 
ment. Thus Cq[d,j) = gq{d)p{j), where p{j) is the position 
bias of position j. 

Cascade Model: This model, proposed by Craswell et al [5], 
assumes that the user examines the search results top down 



and clicks when he finds a relevant document. The proba- 
bility of clicking depends on the relevance of the This model 
also assumes that the user stops scanning documents after 
the first click in the query session. Thus, the probability 
of a document d getting clicked in position j is Cq{d,j) — 
gq{d)Y[kZii^ " 9q{dk)) where dk is the document at rank 
k in the order presented to the user. In some extensions to 
this model, other models have considered multiple clicks in 
a query session. 

Dependent Click Model: The dependent click model (DCM) 
proposed by Guo et al [T0| generalizes the Cascade Model 
to multiple clicks. Once a document has been clicked the 
next position j may be examined with probability 7(j). 
Thus in a user session if Cj is a binary variable indicat- 
ing the presence of a click at document dj at position j then 
Pr{Cj = = 1) = gMhU) and Pr{C, = l|C7,_i = 

Q)=gM)- 

User Browsmg Model: The user browsing model (UBM) 
proposed by Dupret and Piwowarski [6] is a variant of DCM 
where the examination parameter 7 depends not only on the 
current position but also on the position of the last click. 
They assume that given a sequence of click observations in 
positions Ci-.j-i the probability of examining position j de- 
pends on the position j and the distance I of the position 
j from the last clicked document {j = if no document is 
clicked) and is given by a parameter "f{j, I) that is indepen- 
dent of the query. Thus Pr{Cj = l|Ci:j_i) = gq{dj)^{j,l) 
where / is the distance to the last clicked position. 

Click Chatn Model: This model, due to Guo et al.[9] is a 
generalization of DCM that uses Bayesian inference to infer 
the posterior distribution of document relevance. Here if 
there is a click on the previous position, the probability that 
the next document is examined depends on the relevance of 
the document on the previous position. 

Our model is a simple variant of the Examination Hy- 
pothesis where the position bias parameter p{j) is allowed 
to depend on the query q. Unlike most prior works out 
model allows for query specific position bias parameters. 

3. PRELIMINARIES AND MODEL 

This study is based on the analysis of click logs of a com- 
mercial search engine. Such logs typically capture informa- 
tion like the most relevant results returned for a given query 
and the associated click information for a given set of re- 
turned results. In the specific logs that we analyze, each 
entry in the log has the following form - a query q, the top k 
(typically equal to 10) documents D, the click position j, the 
clicked document d G 7?. Such click data can be be used to 
obtain the aggregate number of clicks aq{d,j) on d in posi- 
tion j and the number of impressions of document d £ X> in 
position j, denoted by mq{d, j), by a simple aggregation over 
all records for the given query. The ratio aq{d,j)/m.q{d,j) 
gives us the click through rate of document d in position j. 

Our study extends the Examination Hypothesis (also re- 
ferred to as the Separability Hypothesis) proposed by Richard- 
son et al [16] for ads and later used in the context of search 
results [5l [6] . The examination hypothesis states that there 
is a position dependent probability of examining a result. 
Basically, this hypothesis states that for a given query q, 
the probability of clicking on a document d in position j is 
dependent on the probability of examining the document in 
the given position, eq{d,j) and the relevance of the docu- 
ment to the given query, gq{d). It can be stated as 



Cq{d,j) = eq{d,j)gq{d) (1) 

where Cq{d,j) is the probability that an impression of doc- 
ument d at position j is clicked. All prior works based on 
this hypothesis assume that eq{d,j) = p{j) and depends 
only on the position and independent of the query and the 
document. Note that Cq{d,j) can also be viewed as the 
click through rate on a document d in position j. Thus 
Cq{d,j) can be estimated from the click logs as Cq{d,j) = 
aq{d, j) /mq{d, j). We define the position bias, pq{d,j), as 
the ratio of the probability of examining a document in po- 
sition j to the probability at position 1. 

Definition 3.1 (Position Bias). For a given query q, 
the position bias for a document d at position j is defined as 
Pq{d,j) = eq{d,j)/eq(d,l). 

Next we define the goodness of a search result d for a 
query q as follows. 

Definition 3.2 (Goodness). We define the goodness 
(relevance) of document d, denoted by gq{d), to be the prob- 
ability that document d is clicked when shown in position 1 
for query q, i.e., gq{d) = Cq{d, 1). 

Remark 3.3. Note that our definition of goodness only 
seems to measure the relevance of the search result snippet 
rather than the relevance of the document d. Although this 
merely a simplification in this study, ideally one needs to 
combine click through information with other user behavior 
such as dwell time to capture the relevance of the document. 

The above definition of goodness removes the effect of the 
position from the click through rate of a document (snippet) 
and reflects the true relevance of a document that is inde- 
pendent of the position at which it is shown. Having defined 
the important concepts in our study, we will now state the 
basic assumption on which our model is based. 

Hypothesis 3.4 (Document Independence). The po- 
sition bias pq{d,j) depends only on the position j and query 
q and is independent of the document d. 

Therefore, we will drop the dependence on d and denote 
the bias at position j as Pq{j). Furthermore, by definition, 
Pq{l) — 1 and each entry in the query log will give us the 
equation 

Cq{d,j) = gq{d)pq{j). (2) 

For a fixed query q, we will implicitly drop the q from the 
subscript for convenience and use c{d,j) = g{d)p{j). 

We note that similar models based on product of rele- 
vance and position bias have been used in prior work [151 [5] . 
However, the main difference in our work is that the posi- 
tion bias parameter p{j) is allowed to depend on the query 
whereas earlier works assumed them to be global constants 
independent of the query. 

4. LEARNING THE GOODNESS AND PO- 
SITION BIAS PARAMETERS 

In this section we show how to compute the values g{d) 
and p{j) for a given query based on the above model. Note 



that different document, position pairs in the chck log as- 
sociated with a given query give us a system of equations 
c{d,j) = g(d)p{j) that can be used to learn the latent vari- 
ables g{d) and p{j)- Note that the number of variables in 
this system of equations is equal to the number of distinct 
documents, say m, plus the number of distinct positions, say 
n. We may be able to solve these system of equations for 
the variables as long as the number of equations is at least 
the number of variables. However, the number of equations 
may be more than the number of variables in which case the 
system is over constrained. In such a case, we can solve for 
g{d) and p(j) in such a way that best fit our equations so as 
to minimize the cumulative error between the left and the 
right side of the equations, using some kind of a norm. One 
method to measure the error in the fit is to use the I/2-norm, 
i.e., II c{d,j) — g{d)p{j) ||2- However, instead of looking at 
the absolute difference as stated above, it is more appropri- 
ate to look at the percentage difference since the difference 
between CTR values of 0.4 and 0.5 is not the same as the 
difference between 0.001 and 0.1001. The basic equation 
stated as Equation [2] can be easily modified as 

logc(d, j) = logg(d) + logp(j). (3) 

Let us denote log 3(d), logp(j), logc(d,i)) by ga, Pj, and 
Cdj , respectively. Let £ denote the set of all query, document, 
position combinations in click log. We get the following 
system of equations over the set of entries Eg £ £ in the 
click log for a given query. 

y{d,j) £ Eg ga + Pj = Cdj (4) 
Pi = (5) 

We write this in matrix notation Ax = b where x = 
(gi, g2, ■ . ■ 5m,Pi,p2, . . . ,Pn) represents the goodness values 
of the m documents and the position biases at all the n po- 
sitions. We solve for the best fit solution x that minimizes 
\\ Ax-b ||2= pi + T,(dj)eEgi9d + Pj - Cdjf. The solution 
is given by a; = {A'Ay^A'b. 

4.1 Invertibility of A' A and graph connectivity 

Note that finding the best fit solution x requires that A' A 
be invertible. To understand when A' A is invertible, for a 
given query we look at the bipartite graph B (see Figure [U 
with the m documents d on left side and the n positions j on 
the right side, and place an edge if the document d has ap- 
peared in position j which means that there is an equation 
corresponding to gd and Pj in Equations |4] We are essen- 
tially deducing gd and Pj values by looking at paths in this 
bipartite graph that connect different positions and docu- 
ments. But if the graph is disconnected we cannot compare 
documents or positions in different connected components. 
Indeed we show that if this graph is disconnected then A' A 
is not invertible and vice versa. 

Claim 4.1. A' A is invertible if and only if the underlying 
graph B is connected. 

Proof. If the graph is connected, A is full rank. This is 
because, since pi = 1, we can solve for all gd for all docu- 
ments that are adjacent to position 1 in graph B. Further, 
whenever we have a value for a node, we can derive the val- 
ues of all its neighbors in B. Since the graph is connected. 
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Figure 2: The bipartite graph B of documents and 
positions for a given query. The matrix A is invert- 
ible only if this graph is connected. Even if it is 
disconnected, our model can be used within each 
connected component. 

every node is reachable from position 1. So A has full rank 
implying that A' A is full ranked and therefore invertible. 

If the graph is disconnected, consider any component which 
does not contain position 1. We will argue that the system 
of equations for this component is not full rank. This is 
Ax = Ax' for a solution vector x with certain gd and Pj 
values for nodes in the component, and the solution vector 
x' with values ga — a and Pj + a, for any a. Therefore, A is 
not full rank as we can have many solutions with same left 
hand side, implying A' A is not invertible. □ 

4.2 Handling disconnected graphs 

Even if the graph B is disconnected, we can still use 
the system of equations to compare the goodness and po- 
sition bias values within one connected component. This is 
achieved by measuring position bias values relative to the 
highest position within the component instead of position 1. 
To overcome the problem of disconnected graphs, we solve 
for the solution that assumes that the average goodness in 
the different connected components are about equal. This is 
achieved by adding the following equations to our system: 

\f(d) £ Eg e{gd - ^I) = (6) 

where fj, is the average goodness of the documents for the 
query and e is a small constant that tends to 0. e simply gives 
a tiny weight to these system of equations that is essentially 
saying that the goodness of all the documents are equal (to 
/i). If the bipartite graph is connected, these additional 
equations make no difference to the solution as e tends to 
0. If the graph is disconnected, it combines the solutions 
in each connected component in such a way as to make the 
average goodness in all the components as equal as possible. 

4.3 Limitations of the Model 

We briefly describe some concerns that arise from our 
model and describe methods to address some of these con- 
cerns. 

• The Document Independence Hypothesis I3.4l mav not 
be true as people may not examine lower positions de- 
pending on whether they have already seen a good 



result. Or they may not click on the next document 
if it is similar to previous one. We show a method 
to measure the extent of validity of this Hypothesis in 
Section 15.11 

• Some of the connected components of the bipartite 
graph may be small if a limited amount of click data 
available. 

• For any click based method the coverage of rated doc- 
uments is small as only clicked docs can be rated. In 
Section [S] we show how to increase coverage by infer- 
ring goodness values for unclicked documents through 
clicks on similar queries. 

5. EXPERIMENTAL EVALUATION 

In this section we analyze the relevance and position bias 
values obtained by running our algorithm on a commercial 
search engine click data. Specifically, we adopt widely-used 
measures such as relative error and perplexity to measure 
the performance of our click prediction model. Through- 
out this section, we will refer to our algorithm by QSEH, the 
vanilla examination hypothesis by EH, and the user brows- 
ing model by UBM. The UBM model was implemented using 
Infer. Net [14]. We show that the our model, although much 
simpler to implement, outperforms EH, and UBM. 

Click data 

We consider a click log of a commercial search engine con- 
taining queries with frequencies between 1000 and 100000 
over a period of one month. We only considered entries in 
the log where the number of impressions for a document in 
a top-10 position is at least 100 and the number of clicks 
is non-zero. The truncation is done in order to ensure the 
Cq{d,j) is a reasonable estimate of the click probability. The 
above filtering resulted in a click log, call it Q, contain- 
ing 2.03 million entries with 128,211 unique queries and 1.3 
query million distinct documents. One important charac- 
teristics that affect the performance of our algorithm is the 
frequency. Table [5] summarizes the distribution of query 
frequencies and the average size of the largest component in 
each frequency range. It largely follows our intuition that 
the more frequent queries are more likely to have a search 
result shown in multiple positions resulting in a larger com- 
ponent size. 

Out of the total 2.03 million entries, we sample around 
85,000 query, url, pos triples into the test set in such a 
way that there is at least one entry for each unique query 
in the log; the triples are biased towards urls with more 
impressions. Let us denote the test set by T. This gives us 
a training set iS = Q \ T of around 1.9 million entries. 

Clickthrough Rate Prediction 

We compute the relative error between the predicted and ob- 
served clickthrough rates for each {q,d,j) triple in the test 
set T to measure the performance of our algorithm. We com- 
pute the relative error as \cq{d,j) — Cq{d,j)\/cq{d,j), where 
Cq{d,j) is the predicted CTR from the model and Cq{d,j) is 
the actual CTR from the click logs. A good prediction will 
result in a value closer to zero while a bad prediction will de- 
viate from zero. We present the relative error over all triples 
in T as a cumulative distribution function in Figure O Such 
a plot will illustrate the fraction of queries that fall below 
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Table 1: Summary of the Click Data 
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Figure 3: Cumulative distribution function of the 
relative error between the predicted and observed 
CTR values for QSEH, EH, and UBM. For a relative error 
of 25%, QSEH outperforms UBM by 10.6% and EH by 
6.34%. 



a certain relative error. For example, for a relative error of 
25%, EH produces 48.3% queries below this error, UBM re- 
sults in 46.12% queries, while QSEH results in 51.57% queries 
below this error - an improvement of 10.6% over UBM and 
6.34% over EH. As we can observe from the figure, while 
EH outperforms UBM at smaller errors, the trend reverses at 
larger errors. In Figure O we present the relative error in a 
different way keeping the sign of the error. This figure shows 
that QSEH does much better in not over predicting the CTRs 
when compared to EH and UBM while it does marginally bet- 
ter than UBM when it comes to under-prediction. QSEH under 
predicts by an average 48.6%, EH under-predicts by an aver- 
age 86.54%, and UBM under-predicts by an average 78.09%. 
The respective number in the case of over-prediction are 
44.07%, 78.00%, and 48.95%. 

In another set of experiments, we repeated the above ex- 
periment for queries bucketed according to their frequencies 
to study the effect of query frequency on the CTR predic- 
tion. In this experiment, we estimate average relative error 
over all test triples for queries in the frequency bucket. The 
effect of the query frequency is shown in Figure [S] As the 
figure illustrates, in the case of QSEH, the relative error is 
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Figure 4: Relative error for the test data for the 
Query Specific EH, EH, and UBM methods. 
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Figure 5: The relative error between the predicted 
and observed CTR values for QSEH, EH, and UBM for 
different query frequencies. The average relative er- 
ror for EH is 39.33%, UBM is 43%, and is significantly 
lower for QSEH at 29.19%. 



stable across all query frequencies while it is higher for the 
both EH and UBM. We note that the stable trends in the figure 
are for cases where there are reasonable number of queries in 
that particular frequency range. We can attribute the large 
fluctuation in values for frequency greater than 35000 to the 
small number of queries in any of the frequency bucket (see 
Table [SJ. Finally, we note that the average relative error for 
EH is 39.33% and UBM is 43%, while it is significantly lower 
for QSEH at 29.19%. 

Another measure we use to test the effectiveness of our 
predicted CTRs is perplexity. We used the standard def- 
inition of perplexity for a test set V of query, document, 
position triples as 

where Cq{d,j) is the observed CTR at position j for query 
q and document d and Cq{d,j) is the predicted CTR. This 
is essentially an exponential function of the cross entropy. 
A small value of perplexity corresponds to a good predic- 
tion and can be no smaller than 1.0. We computed the 
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Figure 6: Perplexity for different query frequencies 
on the test data. The average perplexity value for 
Query Specific EH, EH, and UBM is 1.0671, 1.0726, and 
1.0693 respectively. 
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Figure 7: Perplexity at different positions on the 
test data. The average perplexity values for Query 
Specific EH, EH, and UBM are 1.1081, 1.1286, and 1.1211 
respectively. 



perplexity as a function of the position as well as the query 
frequency. In the former we group entries in T by position, 
and in the latter, we simply group by all the queries in a 
certain frequency range. Figures [S] and [7] illustrate the rel- 
ative performance of QSEH, EH, and UBM. For different query 
frequencies, the average perplexity of QSEH is 1.0671. The 
corresponding values for EH and UBM are 1.0726 and 1.0693 
respectively. In the case of different positions, the corre- 
sponding values for QSEH, EH, and UBM are 1.1081, 1.1286, 
and 1.1211 respectively. 

Understanding Patterns of position bias 

We also consider a subset of queries, labeled £C10, whose 
largest component includes all positions 1 through 10 - these 
are queries where the bipartite graph B is a fully connected 
component. This dataset has 112, 735 number of entries with 
2, 614 unique queries and 42, 119 unique documents. We use 
the position bias vectors derived for fully connected compo- 
nents in £C10 to study the trend of the position bias curves 
over different queries. A navigational query will have small 
p{j) values for the lower positions and hence pj (= logpQ')) 
that are large in magnitude. An informational query on the 
other hand will have pj values that are smaller in magni- 
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Figure 8: Position bias curve pj — logp(j) obtained by 
taking median for across 10 category. The categories 
are obtained by sorting queries by sum(p) 

tude. For a given position bias vector p, we look at the 
entropy H{p) — — X]j=i lc>g where \p\ is the sum of 
all the position bias values over all positions. The entropy 
is likely to be low for navigational queries and high for in- 
formational queries. We measured the distribution of H{p) 
over all the 2500 queries in yCClO and divided these queries 
into ten categories of 250 queries each obtained by sorting 
the H{p) values in increasing order. 

We then study the aggregate behavior of the position bias 
curves within each of the ten categories. Figure [8] shows the 
median value nip of the position bias p curves taken over 
each position over all queries in each category. Observe that 
the median curves in the different categories have more or 
less the same shape but different scale, ft would be interest- 
ing to explain all these curves as a single parametrized curve. 
To this end, we scale each curve so that the median log po- 
sition bias mpg at the middle position 6 is set to —1. Essen- 
tially we are computing normalized{mp) = —mp/mpg. The 
normalizedirnp) curves over the ten categories are shown in 
Figure [9] From this figure it is apparent that that the me- 
dian position bias curves in the ten categories are approxi- 
mately scaled versions of each other (except for the one in 
the first category). The different curves in Figure |9] can be 
approximated by a single curve by taking their median; this 
reads out to the vector A = (0, -0.2952, -0.4935, -0.6792, 
-0.8673, -1.0000, -1.1100, -1.1939, -1.2284, -1.1818). The 
aggregate position bias curves in the different categories can 
be approximated by the parametrized curve qA. 

Such a parametrized curve can be used to approximate 
the position bias vector for any query. The value of a de- 
termines the extent to which the query is navigational or 
information. Thus the value of a obtained by computing 
the best fit parameter value that approximates the position 
bias curve for a query, can be used to classify the query as 
informational or navigational. Given a position bias vec- 
tor p, the best fit the value of a is obtained by minimizing 
\\ p — aA II2, which results in a = A'p/A'A. Table [5] shows 
some of the queries in CCIQ with the high and low values 
of e~". Note that e~" corresponds to position bias (since 
p(6) = e'''') at position 6 as per parametrized curve a A. 

5.1 Testing the Document Independence Hy- 
pothesis 
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Figure 9: The best fit curve that describes the gen- 
eral shape of the position bias curve. This is ob- 
tained by taking the median of the normalized po- 
sition bias curves over the 10 categories. 
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Query 




yahoofinance 


0.0001 


writing desks 


0.2919 


ziprealty 


0.0002 


sports injuries 


0.4250 


tonight show 


0.0004 


foreign exchange rates 


0.7907 


winzip 


0.015 


dental insurance 


0.7944 


types of snakes 


0.1265 


sfo 


0.8614 


ram memory 


0.127 


brain tumor symptoms 


0.9261 



Table 2: e " for a few queries. 



Recall that our model is based on the Document Indepen- 
dence HvDothesis l3.4l that is, pq{d,j) is independent of d. In 
this section we show a simple method to test this hypothesis 
from the click data. 

To test the hypothesis we look at the bipartite graph B 
for a query with documents on one side and positions on the 
other and each edge {d,j) is labeled by Cdj. We show that 
cycles in this graph (see Figure I10[) must satisfy a special 
property. 

For each edge (d,j) in this graph, we have a c{d,j) ob- 
tained from the query log. Let C = (di, ji, d2, j2, ds, . . . , dk, jk, di) 
denote a cycle in this graph with alternating edges between 
documents di,d2,..,dk and positions ji,j2,--,jk and con- 
necting back at node di . We now show that our hypothesis 
implies that the sum of the c^, values {cdj — logc{d,j)) on 
odd and even edges on the cycle are equal. This gives us 
a simple test for our hypothesis by computing the sum for 
different cycles. 

Claim 5.1. Given a cycle C — {di,ji,d2,j2,ds, dk,jk,di), 
our Independence hypothesis \3.4\ inipUes that sum{C) = X^iLi '^diu ~ 
Tl!i=i^d.i+i3i ~ (where dk+i is the same as d\ for conve- 
nience) 

Proof. We need to show that Crfiii ~ '^di+iU ■ 

Note that Cdj = Qd + Pj- So Yli=i ^d^i = Yl'i=i 9d, + Pj,- 
Similarly ELi = ELiPd,+i +Pj, = ELi + Pji 

(since dk+i — di). □ 

Clearly in practice we do not expect sum{C) to be ex- 
actly 0. In fact longer cycles are likely to have a larger 
error from 0. To normalize this we consider ratioiC) = 
, , snm(C) ^ The denominator is essentially 
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Figure 10: A cycle C in bipartite graph of documents 
and positions for a given query. To test the Docu- 
ment Independence Hypothesis checl4 if sum{C) — 

^(d.j)eodd edges of C '^'^j ~ X](d,j)eeDen edges of C ~ ^ 



II C II2 where C is viewed as a vector of Cdj values associ- 
ated with the edges in the cycle. The number of dimen- 
sions of the vector is equal to the length of the cycle. So 
ratioiC) — sumiC)/ \\ C II2 is simply normalizing sum{C) 
by the length of the vector C. It can be easily shown theoret- 
ically that for a random vector C of length || C II2 in a high 
dimensional Euclidean space the root mean squared value of 
\ratio{C)\ = \sum{C)\/ \\ C II2 is equal to 1. Thus, a value 
of \ratio{C)\ much smaller than 1 indicates that \sum{C)\ 
is biased towards smaller values. This gives us a method to 
test the validity of the Document Independence Hypothesis 
by measuring \sum{C) \ and \ratio{C) \ for different cycles C. 

We measured the quantities \sum{C) \ and \ratio{C) \ com- 
puted over different cycles C in the bipartite graphs of docu- 
ments and positions over different queries. We found a total 
of 218, 143 cycles of lengths ranging from 4 to 20. Note that 
since this is the bipartite graph the cycle of the smallest 
length is 4 and all cycles must be of even length. Figure [TT] 
shows the distribution of the length of the different cycles. 

For each cycle C = (di, ji, d2, j2, cis, • . • , d*, jfe, rfi), we 
compute the quantity \sum{C)\ as described in Claim [STTI 
Figure [121 shows the distribution of \sum{C)\. We also plot 
\ratio{C)\ in Figure [TS] 

As can be seen from Figure [T2] the median value of \sum{C) \ 
is bounded by about 1 and from Figure[T3]the median value 
of \ratio{C)\ is less than 0.1 for all cycle lengths. While 
the median value |sum(C)| leaves the validity of the Docu- 
ment Independence Hypothesis inconclusive, the small value 
of \ratio{C)\ can be viewed as mild evidence in favor of the 
hypothesis. 

6. USING RELATED QUERIES TO INCREASE 
COVERAGE 

In addition to finding their use in predicting CTRs, the 
goodness values obtained from our model can be employed in 
designing effective search quality metrics that are very well 
aligned with user satisfaction. In this section, we will present 
a method to infer the goodness values of documents that are 
not directly associated with a given query (via clicks) and 
the illustrate the use of these inferred values in computing 
a click-based feature for ranking search results. 

One of the primary drawbacks of any click-based approach 
is the sparsity of the underlying data as a large number 
of documents are never clicked for a query. We present a 
method to extend the goodness scores for a query to a larger 
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Figure 11: Distribution of the lengths of the cycles 
in the bipartite graphs for the queries. 
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Figure 12: Distribution of \sum(C)\ over cycles of 
different length. 



set of documents. We may be able to infer the goodness of 
more documents for a query by looking at similar queries. 
Assuming we have access to a query similarity matrix S, we 
may infer new goodness values Ldq as 



'G. 



dq' 5 



where, Sqqi denote the similarity between queries q and q' . 
This is essentially accumulating goodness values from sim- 
ilar queries by weighting them with their similarity values. 
Writing this in matrix form gives L = SG. The question 
then is how to obtain the similarity matrix S. 

One method to compute S is to consider two queries to be 
similar if they share a lot of good documents. This can be 
obtained by taking the dot product of the goodness vectors 
spanning the documents for the two queries. This operation 
can be represented in matrix form as 5* = GG' . Another 
way to visualize this is to look at a complete bipartite graph 
with queries on the left and documents on the right with the 
goodness values on the edges of the graph. GG' is obtained 
by first looking at all paths of length 2 between two queries 
and then adding up the product of the goodness values on 
the edges over all the 2-length paths between the queries. 

A generalization of this similarity matrix is obtained by 
looking at paths of longer length, say I and adding up the 
product of the goodness values along such paths between 
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Figure 13: Distribution of \ratio{C)\ over cycles of 
different length. The fact that the value of \ratio{C)\ 
is much less than 1 can be viewed as evidence for 
the Document Independence Hypothesis 

two queries. This corresponds to the similarity matrix S — 
{GCy. The new goodness values based on this similarity 
matrix is given by L = (GG'YG. We only use non-zero 
entries in L as valid ratings. 

Relevance Metrics 

We measure the effectiveness of our algorithm by comparing 
the ranking produced when ordering documents for query 
based on their relevance values to human judgments. We 
quantify the effectiveness of our ranking algorithm using 
three well known measures: NDCG, MRR, and MAP. We 
refer the reader to [19] for an exposition on these measures. 
Each of these measures can be computed at different rank 
thresholds T and are specified by NDCG@T, MAP@T, and 
MRR@T. In this study we set T = 1, 3, 10. 

The normalized discounted cumulative gains (NDCG) mea- 
sure [] discounts the contribution of a document to the over- 
all score as the document's rank increases (assuming that 
the most relevant document has the lowest rank). Higher 
NDCG values correspond to better correlation with human 
judgments. Given an ranked result set Q, the NDCG at a 
particular rank threshold k is defined as 

where r{j) is the (human judged) rating (0=bad, 2=fair, 
3=good, 4=excellent, and 5= definitive) at rank j and 
Zk is normalization factor calculated to make the perfect 
ranking at k have an NDCG value of 1. 

The reciprocal rank (RR) is the inverse of the position of 
the first relevant document in the ordering. In the presence 
of a rank-threshold T, this value is if there is no relevant 
document in positions below this threshold. The mean re- 
ciprocal rank (MRR) of a query set is the average reciprocal 
rank of all queries in the query set. 

The average precision of a set of documents is defined as 
the ^n'" -^^''^"""'^'^t')/' where i is the position of the doc- 

— Helevance[ij ' ^ 

uments in the range [1, 10], and Relevanceii) denotes the 
relevance of the document in position i. Typically, we use a 
binary value for Relevance(i) by setting it to 1 if the docu- 
ment in position i has a human rating of fair or more and 



Figure 14: Since many relevant documents for a 
query may have no cHck data we infer Goodness 
values by using similar queries. NDCG scores af- 
ter thus increasing coverage by with I = 1 and 2 
respectively. 

otherwise. The mean average precision{MAP) of a query 
set is the mean of the average precisions of all queries in the 
query set. 

Isolated Ranking Features 

One way to test the efficacy of a feature is to measure the ef- 
fectiveness of the ordering produced by using the feature as 
a ranking function. This is done by computing the resulting 
NDCG of the ordering and comparing with the NDCG val- 
ues of other ranking features. Two commonly used ranking 
features in search engines are BM25F [18] and PageRank. 
BM25F is a content-based feature while PageRank is a link- 
based ranking feature. BM25F is a variant of BM25 that 
combines the different textual fields of a document, namely 
title, body and anchor text. This model has been shown to 
be one of the best-performing web search scoring functions 
over the last few years [191 14]. To get a control run, we also 
include a random ordering of the result set as a ranking and 
compare the performance of the three ranking features with 
the control run. 

We compute the NDCG scores for this algorithm. We 
start with a goodness matrix G with 7 = 0.6 containing 
936606 non-zero entries. Figure [T^ shows the NDCG scores 
parameter I set to 1 and 2 respectively. The number of non- 
zero entries increase to over 7.1 million for I = 1 and over 42 
million for / — 2. However, the number of judged <query, 
document> pairs only increase from 74781 for / = 2 to 87235 
for I — 1. This implies that most of the documents added by 
extending to paths of length 2 are not judged results in the 
high value of NDCG scores for the Random ordering. If we 
were to judge all these 'holes' in the ratings, we think that 
we will see a lower NDCG score for the Random ordering. 

7. CONCLUSIONS 

In this paper, we presented a model based on a general- 
ization of the Examination Hypothesis that states that for 
a given query, the user click probability on a document in a 
given position is proportional to the relevance of the docu- 
ment and a query specific position bias. Based on this model 
we learn the relevance and position bias parameters for dif- 
ferent queries and documents. We do this by translating the 



model into a system of linear equations that can be solved to 
obtain the best fit relevance and position bias values. We use 
the obtained bias curves and the relevance values to predict 
the CTRs given a query, url, and a position. We measure 
the performance of our algorithm using well-used metrics like 
log-likelihood and perplexity and compare the performance 
with other techniques like the plain examination hypothesis 
and the user browsing model. 

Further, we performed a cumulative analysis of the po- 
sition bias curves for different queries to understand the 
nature of these curves for navigational and informational 
queries. In particular, we computed the position bias pa- 
rameter values for a large number of queries and found that 
the magnitude of the position bias parameter value indi- 
cates whether the query is informational or navigational. 
We also propose a method to solve the problem of dealing 
with sparse click data by inferring the goodness of unclicked 
documents for a given query from the clicks associated with 
similar queries. 
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